A tidy dataset of variants of the Portuguese “Vinho Verde” wine were used for this analysis. The dataset comes from a 2009 study. It consists largely of sensory (output) variables and physicochemical (input) variables. This was used as the source for e.g: field descriptions: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
The data includes 1,599 observations and thirteen variables.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Fixed.acidity is measured as the concentration (in g/dm^3) of tartaric acid. Most acids in wine fall in this category. We see a fairly normal distribution here.
Volatile.acidity is the concentration (in g/dm^3) of acetic acid in the wine. Higher levels of this can lead to, “an unpleasant, vinegar taste.”. Again, a fairly normal distribution with a bit of a long tail, slightly bi-nodal.
Citric acid concentration is in (g/dm^3). Apparently it, “adds ‘freshness’ and flavor to wines”. This looks like a sightly positively-skewed data. There appear to be a number of observations with very low levels of citric acid, and spikes at 0.25 g/dm^3 and 0.50 (g/dm^3). The outlier at 1.0 gives a longer-tail.
Residual sugar concentration (g/dm^3) has an early spike and a very long tail. It’s the amount of sugar remaining after fermentation stops.
Taking the log of the residual sugar concentration smoothes out the distribution a bit.
Chlorides represent the amount of salt in the wine. The concentration of sodium chloride (in g/dm^3). There is a spike of chlorides and a long tail of ourliers.
Transforming the value via a log, we see more of a spike with outliers than a bell curve.
A positively-skewed distribution of free.sulfur.dioxide (in (mg/dm^3)) with a long tail is observed. Free.sulfur.dioxide prevents microbial growth and wine oxidation.
The total amount of sulfur dioxide includes free and bound forms of S02 (in (mg/dm^3)). At concentrations of free SO2 over 50 ppm, SO2 becomes evident in the taste and nose of a wine. Here we see a positively-skewed distribution.
Taking the log, we see a more-normal distribution of the total.sulfur.dioxide.
Without transformation, we see a nice bell curve for density. Density here is in (g/cm^3). Apparently the density depends a bit on the percent of alcohol and sugar content.
We see anormal distribution of wines by pH. Not being previously familiar with the pH of wine, I was surprised to see it so ascidic (neutral water has a pH of 7).
Sulphates measures concentration of potassium sulphate (in g/dm3). It’s an additive that can contribute to S02 gas levels, and acts as an antimicrobial and antioxidant.
Sulphates are bit more normalized with log scale applied.
Alcohol (in % by volume) is positively skewed.
Alcohol maintains its positive skew even with a log transformation.
Finally we get to quality scores (on a scale of 0 to 10). We see that overwhelmingly, most wines received a 5 or 6. Strikingly, no values for the extremes: 0,1,2 or 9 and 10 are represented.
The dataset for red wine is in a tidy format with separate observations of a particular wine on one row. The dataframe is wide with separate columns for each variable.
The ‘X’ column is an id stored as an integer and Quality (the, “output variable”) is stored as an integer value. All other columns (“input variables”) contain measurements stored as double precision floating point numbers.
The main feature of interest in this dataset is quality. Quality is the single “output” variable that we can try to determine using the various “input” variables. Strikingly, though quality is supposed to be rated on a scale of 0-10, there are no observed values of 0,1,2 or 9 and 10 in the dataset. Most values are in the middle, either 5,6, or a 7, and there are a few observations of 3,4, and 8.
Though potentially, any of the, “input variables” could help us understand the quality “output variables”, the notes, proclaim we may see either positive or negative correlations between certain inputs and quality: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt,
Particularly we can look to see these: - volatile acidity - when too high can lead to an, “unpleasant, vinegar taste”. - citric acid - can add, ‘freshness’ and flavor“. - total sulfur dioxide -”at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine“”
Later we see that 3 groupings were made for quality (rather than using the 0-10 scale present), but other than that, no new variables were created from the dataset (except to create a subset of data for correlations that didn’t include the observation identifier).
Alcohol has an interesting non-normal distribution that wasn’t much affected by log tranformations. Some plots were re-arranged to cast-out outliers, but generally few operations were performed to change the form of the data.
## cor
## -0.3905578
We see that (as predicted), lower mean volatile.acidity is correlated with higher wine quality. This is in line with the general expectation that volatile.acidity is, “unpleasant” when increased. The orange line represents the mean. The blue dotted lines represent quantiles of 10%, 50% (the median), and 90%. This pattern will be repeated below.
## cor
## 0.2263725
We see that (as predicted), there is a small correlation that higher citric acid is correlated with higher wine quality.
## cor
## -0.1851003
We see a small negative correlation of total.sulfur.dioxide, in-line with expectations.
## cor
## 0.4761663
The strongest correlation for quality was between % Alcohol Content and Quality with a Pearson’s r of 0.48. The mean value of alcohol % increases with quality.
Since the correlation between Alcohol and Quality was strongest, it was examined first. The strongest correlation appearred to be between it and density.
## cor
## -0.4961798
A negative correlation (r = -.50) was found between percent of alcohol and density. Thus, the more alcohol, the less density was seen.
## cor
## -0.5524957
There was a fairly significant (r = -.55) negative correlation between volatile acidity and citric acid.
## cor
## 0.6676665
There was a significant positive correlation (r=.67) between free and total sulfur dioxide, though this was to be expected.
## cor
## 0.6680473
Another strong positive correlation (r = .67) was found between density and fixed acidity.
## cor
## -0.6829782
A strong negative correlation (r = -.68) in the data was found between pH and Fixed Acidity. This makes sense as lower pH is used to measure higher acidity.
To summarize, correlations were discovered between Quality (the feature of interest) and: - alcohol (r= .48) (strongest positive correlation) - citric.acid (r = .23) - volatile.acidity (r = -.39) (strongest negative correlation) - total.sulfur.dioxide (r= -.19)
Generally each of these were expected except for the strong correlation with alcohol content and quality.
Correlations were also found between:
The highest correlation (r = -.68) in the data was the negative correlation found between pH and Fixed Acidity. This makes sense as lower pH values are used as a measurement of higher acidity.
Three variables are plotted here – analysis is below.
Rather than just using variable quality levels, here the only difference is that quality was plotted in distinct groups. The aim was for the groupings to “pop out” a more. It is possible to see that higher qualities (the 6-10 bucket) tend to appear to the lower right of the graph. The 0-4 qualities tend to appear to the upper left, and the 4-6, average qualities are found in-between. The overall finding is that higher percent alcohol and lower volatile acidity tends to be associated with higher rated wine.
Three variables are again plotted here – analysis is below.
Here the quality bins are again used to show how citric acid, and percent alcohol affect quality. The high quality grouping lies to the upper right, and the low quality lies to the lower left (with a high variance in this case). Thus, we see again that higher citric acid levels and higher percent alcohol are both generally correlated with higher quality ratings. Of note, also is that there are many observations with no citric acid at all.
Analysis is below.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wine)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + total.sulfur.dioxide,
## data = wine)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + total.sulfur.dioxide +
## density, data = wine)
##
## =====================================================================
## m1 m2 m3 m4
## ---------------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 3.305*** -16.936
## (0.175) (0.184) (0.192) (10.264)
## alcohol 0.361*** 0.314*** 0.302*** 0.320***
## (0.017) (0.016) (0.016) (0.019)
## volatile.acidity -1.384*** -1.371*** -1.353***
## (0.095) (0.095) (0.095)
## total.sulfur.dioxide -0.002*** -0.002***
## (0.001) (0.001)
## density 20.103*
## (10.192)
## ---------------------------------------------------------------------
## R-squared 0.227 0.317 0.323 0.325
## adj. R-squared 0.226 0.316 0.322 0.323
## sigma 0.710 0.668 0.665 0.664
## F 468.267 370.379 253.797 191.665
## p 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1614.623 -1612.674
## Deviance 805.870 711.796 705.423 703.706
## AIC 3448.114 3251.628 3239.246 3237.349
## BIC 3464.245 3273.136 3266.132 3269.611
## N 1599 1599 1599 1599
## =====================================================================
Looking at the plot of, “Alcohol vs Volatile Acidity vs Quality”, it’s evident that the higher quality wines tend to fall to the lower right of the graph and the lower quality wines fall to the upper left. Thus higher alcohol content and lower volatile.acidity is associated with higher-quality wines.
Also plotted was: Alcohol vs Citric Acid vs Quality and it’s observed that the higher quality wines tended to the upper right quadrant - and lower quality wines fell in the lower left – in line with expectations.
Finally, for, “Free Sulfur Dioxide vs Total Sulfur Dioxide vs Quality Rating” More detail is given on the relationship between total sulfur dioxide and free sulfur dioxide (r = .67). The plot reveals that there is no clear relationship between them and quality as we observe great variance in quality plots. This result was not unexpected, but it was interesting to see what the plot looked like.
As an exercise, a basic linear model was created, composed of alcohol, volatile.acidity, total.sulfur.dioxide, and density inputs. It largely did not perform very well, as its R-squared value was just 0.325. Oddly, adding citric.acid (r = .23) as an input didn’t appear to improve the results in spite of its correlation with quality. Also attempted was adding the logarithm of total.sulfur.dioxide, but it did not improve the results.
The previous analysis of this graph will be not be repeated here (see above for the previous description). We additionally see that by analyzing trend lines, with low quality wines (tending to appear to the upper left), volitile acidity tends to increase with an increased percent of alcohol, while the same trend is slightly opposite for average quality wines, and quite flat for high quality wines (tending to the lower right).
Adding to the previous analysis of this graph (see above), we see (via the trend lines) for the low and high quality wines, a decrease in percent of citric acid concentration with an increase in alcohol percentage. Oddly, there is little affect for the average quality wines. The position of the high quality wines having higher citric acid is clearly distinguished from low quality wines via the trend lines, though average quality wines tend almost converge with high quality wines at a level of 14% alcohol concentration.
Again, past analysis will not be revisited here, but it can be clearly seen that the slope of the relationship between free sulfur dioxide and total sulfur dioxide is consistent for all three quality trendlines, thus giving further evidence of that we’re seeing a relationship of dependent variables i.e: that we may be seeing different variables that hold a similar relationship.
Many of the data revalations in this study came from the correlation matrix between variables. The data plots largely verified / confirmed the correlations / distributions were valid and revealed additional data variances and outliers. It was interesting to see that though e.g: density had a relatively high correlation with alcohol (r = .50) and that alcohol had a high correlation with quality (r = .48) that density (and other influencing variables) did not have a high correlation with quality (r = -.18 ). The linear model did not work out as well as would have been ideal, as an R-squared of 0.325 has limited predictive utility.
Revealing plots were created for highly-correlated variables, such alcohol, volatile acidity and quality. Adding trendlines in the graph proved fruitful, further revealed the clear distinctions among different quality levels.
While the trendlines in the graph, “Multivariate: Alcohol vs Citric Acid vs Quality” were complex, trends were made clearer by grouping qualities in the graph, “Alcohol vs Citric Acid vs Quality Rating”, a success.
I struggled for quite a while, researching how to adjust the font-size of the correlation matrix to make it readable. It was surprisingly complex to get a readable plot. The ggcorr function produced a more-useable heatmap/correlation matrix, though there were still some issues with the text (to the lower left).
I also struggled a bit with color palates and using the factor() function to enable proper distinct plotting of trendlines.
Other researched items in the sources (below) indicate other areas where internet research was used to implement fixes.
As for future work, it would be useful to have a richer dataset to test with e.g: the type of grape, the geographical location of the vineyard, the vintage, the type of cask the grapes were stored in, the label of the grape, to know which reviewer gave a review for each grape: e.g: reviewer A could have different tastes than reviewer B.,
Boxplot with 1 axis: https://stackoverflow.com/a/40700387/234975 Smaller text for correlation matrix: https://stackoverflow.com/a/39716408/234975 Legend text manipulation: https://stackoverflow.com/a/38938781/234975 Color palette setting: https://books.google.fr/books?id=_iVFgKTRYrQC&lpg=PA74&ots=XY9TTXtbgC&dq=ggplot%20%22smaller%20points%22&pg=PA77#v=onepage&q=ggplot%20%22smaller%20points%22&f=false horizontal / vertical lines: http://www.cookbook-r.com/Graphs/Lines_(ggplot2)/ log ticks: http://ggplot2.tidyverse.org/reference/annotation_logticks.html background color: https://stackoverflow.com/a/6736412/234975